2021-03-30
This report provides an analysis of the Grown Strong March 2021 survey results, as well as a demonstration of some novel methods that Grown Strong may be interested in leveraging as they continue to scale.
The report has been compiled by Grown Strong in collaboration with TwoKai; an Artificial Intelligence Consultancy. This report was created using Python and Jupyter Notebook (Python provided the programming language, and Jupyter Notebook is an environment in which Python code can be written. Jupyter Notebooks also provide the ability to produce presentation-friendly reports such as this, without having to display all the code required to produce the report).
This section provides a brief overview of the March 2021 survey, and provides an analysis of the high-level results of the survey. It looks particularly at the performance of the survey in terms of unique respondents, missingness and data quality, as well as summarising the dimensions and variables of the dataset returned from the survey.
The main purpose of the March 2021 survey was to retrospectively collect data on Grown Strong's customers. A particular emphasis of understanding the customers was sought; including customer demographics and the thoughts and opinions of Grown Strong's products from the customers' perspective.
It is important to note that, while surveys can of course return a rich sample of data, it does not necessarily reflect the true population of customers. If a survey is completely optional and has no significant incentive (e.g. monetary prizes) respondents will often be those that feel strongly in one way or another. This can make the data inherently polarised, with highly satisfied and very unsatisfied responding most.
For this reason, TwoKai's reccommendation is to use the following analysis of the March 2021 survey as a springboard for further questioning and ideation, and consequently further and more precise data collection (with the intention of getting a larger sample size that will reflect the true population better, with desired variables).
Figure 1 shows that there were 147 unique respondents, as well as 147 records in the data. There were therefore 0 duplicated respondents, (uniqueness based on name and email address variables).
In the survey, 23 different questions were asked in total. As shown in Figure 1, this resulted in 32 columns in the dataset - due to questions that allowed an "all that apply" selection answer. It is worth noting that these types of questions in future data collection can result in highly dimensional data with numerous columns (which make data cleaning and modelling more complex).
| Number of unique respondents | 147 |
|---|---|
| Number of rows in dataset | 147 |
| Number of questions asked in survey | 23 |
| Number of columns in dataset | 32 |
Figure 1: Table of summary statistics for the March 2021 Grown Strong customer survey.
Figure 2 shows that, after cleaning the data, there were 18 categorical variables returned and 14 numerical/boolean variables. Of the 18 categorical variables, there were 5 variables that offered customers to write freely and offer thoughts, opinions and suggestions (as opposed to mandatory free text fields). These variables in particular are analysed in later sections, as they offer some of the most critical insights into Grown Strong's customers.
Figure 2: Frequency of categorical and numerical variables returned from the survey.
The first 5 rows of the cleaned survey results dataset for, with name and email address variables removed for anonymisation, are displayed in Figure 3 for manual examination.
| gender | ethnicity | home_location | household_income | num_work_week_hrs | has_children | fitness_level | nutrition_level | olympic_lifting_experience | goal_lose_fat | goal_gain_muscle | goal_maintain_fitness | goal_gain_weight | goal_competition_ready | goal_improve_crossfit | goal_get_stronger | goal_gain_confidence | goal_other | most_used_gs_program | num_gs_sessions_per_week | workout_location | uses_other_workouts | uses_other_workouts_further_info | gs_provision_suggestion | has_joined_facebook_group | gs_improvement_suggestion0 | why_not_joined_facebook_group | gs_likely_recommendation | gs_improvement_suggestion1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | f | african-american | other | less than 25,000 | NaN | 1.0 | 0 | 1 | I do it every once in a while | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | GS20 | 2-3 | crossfit gym | 0.0 | NaN | NaN | 0.0 | NaN | NaN | 1.0 | gh |
| 1 | f | caucasian | north america/central america | 25,000 - 50,000 | NaN | 1.0 | 7 | 8 | I do it regularly | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | GS60+ | 5+ | garage or outside house | 0.0 | NaN | NaN | 1.0 | NaN | NaN | 10.0 | love the program |
| 2 | f | asian | north america/central america | 50,000 - 100,000 | 26-50 | 1.0 | 7 | 5 | I used to do it, but not anymore | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | GS60 | 1-2 | garage or outside house | 1.0 | i primarily use street parking and will supple... | NaN | 1.0 | no i enjoy the girls and everyones encouragement | NaN | 10.0 | NaN |
| 3 | f | caucasian | north america/central america | less than 25,000 | 26-50 | 0.0 | 7 | 6 | I do it regularly | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | NaN | GSOLY | 5+ | inside house | 0.0 | NaN | NaN | 1.0 | NaN | NaN | 10.0 | it is amazing |
| 4 | f | caucasian | north america/central america | 50,000 - 100,000 | 26-50 | 1.0 | 2 | 2 | I used to do it, but not anymore | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | NaN | GS30 | 0-1 | garage or outside house | 0.0 | NaN | NaN | 1.0 | NaN | NaN | 10.0 | NaN |
Figure 3: First 5 rows of the cleaned survey results dataset.
Cleaned variables with no missing values were goal_get_stronger, workout_location, goal_gain_confidence, name, goal_improve_crossfit, email_address, goal_gain_weight, goal_maintain_fitness, goal_gain_muscle, goal_lose_fat, olympic_lifting_experience, nutrition_level, fitness_level, has_children, home_location and age, goal_competition_ready
The cleaned variable with the most missing values was why_not_joined_facebook_group with 95.2% of values missing.
Figure 4 shows the percentage of missing values for all cleaned variables. The optional free text field variables had the fewest responses.
Figure 4: Percentage of missing values for each variable arranged in order of ascending missingness percentage.
This section analyses the data and its cleaned variables in greater detail in order to provide a better description of the customers that responded to the survey. Variables are examined and presented in no particular order, however variables with fewer missing values are presented first.
Despite reasonably low numbers for the survey, the distribution of ages is apparently normally distributed as would be expected with products that Grown Strong provide. There is a right skew to the distribution, indicating that respondents are more likely to be older than younger - there is quite a large disparity between age-groups 18-24 and 25-34.
Figure 5 shows a histogram of age-group. The age-group with the most respondents was 25-34 with 75 respondents in total. Conversely, the age-group with the fewest respondents was 55-64 with 2 respondents in total.
The clear disparity between age-groups 18-24 and 25-34 is worth noting. It could indicate that most of the respondents in the 25-34 age-group are at the higher end of this bin, as reflected by the right skew of the distribution. Further probing here could be valuable, and if collection of customer age data is pursued in future, it would be useful to collect dates of birth to get a fully granular picture. If most customers in the 25-34 age-group are in fact at the higher end of this bin, it could be due to lower ages being priced out among other reasons. More granular data would be required to explore this hypothesis further.
Figure 5: Number of respondents by age-group, in order of ascending age-group.
Figure 6 displays the number of respondents split by gender. Female respondents were highest in number by a large margin, with 142 respondents, while there were only 3 Male respondents.
Figure 6: Number of respondents by gender.
Caucasian respondents were most numerous in the survey, with 97 responses; this was a clear majority when compared with other ethnicity options. For example, there was only 1 response from African-American respondents. Figure 7 shows the number of respondents for all ethnicity options used in the survey.
Data on ethnicity is notoriously difficult to collect and analyse due to the sensitivity of the data. Often, customers feel uncomfortable having these data recorded, and navigating the multitude of different options required to include for effective ethnicity data collection can be difficult. Furthermore, using ethnicity data in statistical modelling methods like machine learning is controversial at best, as it opens possibilities for racial biases to be unintentionally captured in algorithms. Customer segmentation based on ethnicity, even if used alongside other variables, can also pose ethical issues in terms of generalisation. Unless Grown Strong have ethical reasons to attempt to target specific ethnicities (for example, using ethnicity to remove racial biases from algorithms), TwoKai recommends using these data with extreme caution.
Figure 7: Number of respondents by ethnicity.
The Home Location that respondents selected most was North America/Central America with 125 respondents. This was a clear majority (the Home Location with fewest respondents was Asia with only 1 respondent).
Given the number of respondents selecting North America/Central America as their Home Location, it may be prudent to increase granularity for this area in future data collection, by splitting the option into more specific locations. Figure 8 below shows the number of respondents by all Home Locations.
Figure 8: Number of respondents by home location.
Figure 9 shows a histogram of number of respondents by household income. While initially the distribution appears normal (albeit with a left skew), TwoKai would recommend exercising caution when drawing insights from these responses, until more data has been collected. Firstly, the income bins are not consistent in size, so some income groups could appear more frequent merely by capturing a greater range of respondents. Secondly, bins are overlapping as opposed to being distinct, so respondents could be making inconsistent selections (some respondents could select the lower income group while others select the upper income group). Further data collection on this variable is highly recommended as accurate income data can be extremely statistically powerful.
Figure 9: Number of respondents by household income group.
Most respondents belonged to the 26-50 working hours per week group, with 100 respondents selecting this option. Figure 10 shows a histogram of Number of Working Hours per Week for respondents.
It would be prudent to increase granularity for this variable (for example adding more groups customers can select), as it is difficult to draw a great deal of insight from 3 income groups (when conducting univariate analysis). The value of increased granularity for this variable would for example be in distinguishing between customers who do not work during the week and those who work part-time, because this distinction cannot be made using the groups in Figure 10.
Figure 10: Number of respondents by number of working hours per week.
Respondents were asked to rate their fitness level on a scale of 0-10, with higher numbers indicating a higher level of fitness, and vice versa for lower numbers. Figure 11 shows a histogram of respondents' fitness levels, with the most repondents rating themselves at level 7 out of 10 (40 respondents).
It is important to note that this is a rating based on the respondents' personal opinion. So rather than an indicator of actual fitness level, it is a reflection of how a person feels about their fitness level. A majority value of 7 therefore indicates that most respondents feel they are more fit than the average person, yet still have room for further improvement.
There were 8 respondents who rated themselves as 10 out of 10 for fitness level, indicating that they believe they are fulfilling their maximum fitness potential. With more data collection, it would be very interesting to track these cohorts of customers to observe what products they are using, and understand why they rate themselves at this level.
Figure 11: Number of respondents by fitness level.
Respondents were asked to rate their nutrition level on a scale of 0-10, with higher numbers indicating a better level of nutrition, and vice versa for lower numbers. Figure 12 shows a histogram of respondents' nutrition levels, with the most repondents rating themselves at level 5 out of 10 (33 respondents).
As with Fitness Level above, it is important to note that this is a self-rating, and indicates respondents' personal opinion of their nutrition. The distribution is reasonably spread with respondents rating themselves most frequently between 5 and 8. Again, as with Fitness Level, there were 6 respondents who have rated themselves at 10 out of 10 for nutrition. Collecting data on these individuals going forward would be very worthwhile, as it could show demand for products aimed at more advanced customers who wish to be challenged further.
Figure 12: Number of respondents by nutrition level.
Figure 13 shows that respondents mostly do olympic lifting regularly (46 respondents). When compared with the general population, it is particularly insightful that this was the most frequent response from those surveyed, and indicates customers have, or believe they have, a very high calibre of fitness ability. Examining this alongside the previous two sections (analysing fitness and nutrition levels), it is possible that a market fit for advanced fitness individuals has been established.
If this is the case (further data collection would be required to form a better hypothesis), then building a community around these like-minded individuals could be a highly beneficial strategy. Conversely though, there could be a risk of excluding customers with very little experience. In either case, continuing to examine the ability of these advanced customers by using different data collection methods (automated product usage data collection, for example) would be very valuable, as these customers would in most settings be considered outliers (and therefore of significant interest).
Figure 13: Number of respondents by olympic lifting experience.
Figure 14 shows that respondents who said that they use the GS60+ program the most had the highest frequency (51 respondents). Conversely, respondents who said that they use the GS20 the most had the lowest frequency (9 respondents).
Figure 14: Number of respondents by most used Grown Strong program.
Figure 15 shows a histogram of the number of Grown Strong sessions respondents participate in per week. The most frequent group for respondents was 4-5 sessions per week (52 respondents). The least frequent group for respondents was 0-1 sessions per week (8 respondents).
Insights drawn from the survey data should be done so with caution as the groups used are overlapping, so respondents could be making inconsistent selections. As such, it is important to collect data at the most granular level for this variable as it is likely one of the most critical metrics to monitor on an automated basis going forward. This will provide insights on some crucial KPIs, such as retention rate. Many models and automated analytical additions to the product could use these data to great value, for example by creating automated prompts when individual customer engagement is descreasing. TwoKai's recommendation would therefore be to consider prioritising automated collection of these data.
Figure 15: Number of respondents by number of Grown Strong sessions per week.
Respondents were asked to select all goal types that they feel apply to them, from a pre-defined list of goals. Figure 16 shows that from this list, the most frequently selected goal was to Get stronger (119 respondents). The least frequently selected goal was to Gain weight (3 respondents).
Figure 16: Number of respondents by goal type selected.
In this section a KMeans clustering algorithm is run on a set of numerical variables from the survey dataset. KMeans clustering is an effective unsupervised machine learning method that enables clustering on data with multiple variables. It can perform well on smaller datasets (hence choosing it for this task).
Before clustering, the data are scaled and transformed using a methodology called Principle Component Analysis (PCA). These methods scale values so they are comparable between variables (making it easier for the clustering algorithm to learn) and reduce the number of variables, whilst maximising the variance of the data for the clustering algorithm. Two so-called "components" are used for this task, so we can visualise the performance of the clustering on a scatterplot. This is shown in Figure 17. The KMeans clustering algorithm calculated that the optimal number of clusters for this dataset is 4.
The reduction of variables by using PCA is a heavily statistical method, so can be quite an abstract concept. The main message of Figure 17 is gleaned through visually assessing how well defined the cluster boundaries are.
Figure 17: PCA Components by Cluster.
The variables used by the algorithm were hand picked based on a mixture of data availability and key markers suggested by Grown Strong. Specifically, the variables used were:
Figure 18 shows a less abstract demonstration of the clustering algorithm performance by plotting two (out of the 16) variables used in the algorithm. Nutrition level and fitness level are plotted against eachother to provide an example, with cluster colors overlayed. In a more dedicated project, post-modelling analysis would be conducted for all variables used in the algorthim, as well as those not used in the algorithm. This was beyond the scope of this report, however.
Yet even using these two variables shown in Figure 18, we can see the clustering algorithm is performing reasonably well: Cluster 0 appears to be capturing the highest fitness and nutrition level respondents. Cluster 1 is capturing some second highest fitness and nutrition level respondents, but appears to put more emphasis on fitness level than nutrition level (indicating likely interactions with other variables to determine the cluster). Cluster 3 appears quite agnostic, and could indicate the "average" customer. It'd be worth analysing other variables for this cluster given its even spread across nutrition and fitness level. Finally cluster 2 appears to capture the lower end customers in terms of fitness and nutrition level.
The application of this kind of algorithm would be to support analysis into areas such as pricing strategy for subscriptions (answering questions such as which subscriptions would apply to which types of customers, as an example. In this case, perhaps customers belonging to cluster 0 would suit more advanced subscription packages, and customers in cluster 2 would suit more beginner-friendly subscription packages).
Figure 18: Nutrition level against fitness level by cluster number.
Unsupervised learning used for clustering, as demonstrated briefly above, can be an incredibly powerful methodology to implement into a business model, particularly in terms of understanding customer segments. With a fully automated process and an effective clustering algorithm, features such as automated, bespoke product suggestions can be added to increase customer engagement, whilst significantly reducing time spent manually analysing customers.
TwoKai proposes further collection of data to make sure that samples used to train clustering algorithms are more representative of the true customer population. This can be done using the same survey method as conducted for this report. Automated data collection is thoroughly recommended going forward, but work can be carried out using the survey results - as an intuitive suggestion for number of responses, more robust results would be gained with ~400-500 respondents in total. The algorithm trained on the survey can then be validated using incoming data collected automatically, and if the clusters allocated to these new customers appear accurate, implementing this algorithm into the products could then be seriously considered.
The above demonstration was a deep dive into unsupervised clustering. When Grown Strong are in a position that they want to pursue this further, a much larger scale project will be required in order to get the best results from modelling before deploying into products and platforms.
Unfortunately there is not enough data to perform some truly valuable sentiment analysis, however in this section a pretrained sentiment analysis algorithm is run on one free text field variable for demonstration purposes. This kind of algorithm can be built into pipelines to analyse data such as trustpilot reviews without having to read each review individually. Properly built, it can be an extremely powerful method of automation.
Figure 19 below shows a table of some example texts taken from the survey results, and the score allocated by the sentiment analysis algorithm.
| gs_improvement_suggestion1 | sentiment_analysis_score | |
|---|---|---|
| 0 | gh | Neutral |
| 1 | love the program | Positive |
| 2 | it is amazing | Positive |
| 3 | better instructions guidance | Positive |
| 4 | there are often inconsistencies between the vi... | Neutral |
| 5 | sometimes i find all the sections of the worko... | Neutral |
| 6 | i love our community and am hoping for more gr... | Compound |
| 7 | love the programming for someone who did not k... | Compound |
| 8 | do daily or weekly challenges that encourage us | Neutral |
| 9 | i would like to see more ways to improve my pu... | Neutral |
Figure 19: Table of free text fields and the sentiment scores allocated by the
sentiment analysis algorithm.